Sibling variation in polygenic traits and DNA recombination mapping with UK Biobank and IVF family data

We use UK Biobank and a unique IVF family dataset (including genotyped embryos) to investigate sibling variation in both phenotype and genotype. We compare phenotype (disease status, height, blood biomarkers) and genotype (polygenic scores, polygenic health index) distributions among siblings to those in the general population. As expected, the between-siblings standard deviation in polygenic scores is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sqrt{2}$$\end{document}2 times smaller than in the general population, but variation is still significant. As previously demonstrated, this allows for substantial benefit from polygenic screening in IVF. Differences in sibling genotypes result from distinct recombination patterns in sexual reproduction. We develop a novel sibling-pair method for detection of recombination breaks via statistical discontinuities. The new method is used to construct a dataset of 1.44 million recombination events which may be useful in further study of meiosis.

1. The sibling distribution of (additive) polygenic scores (PGS) for such traits is centered at the parental midpoint 2. The sibling genetic variance is roughly half that found in the general population (i.e., of individuals who are not closely related to each other). The probability distribution of sibling polygenic scores has standard deviation √ 2 times smaller than in the general population 39 .
Whether observed phenotypes also exhibit these properties depends further on the nature of environmental influences, and interactions between genotype and environment 40 . If environmental influences are themselves roughly additive, and mostly independent of genotype (and of parental background), we expect that sibling phenotypes are also normally distributed, roughly about the parental midpoint, with reduced variance compared to that found in the general population. The degree of reduction in variance depends on the heritability of the trait in question. We examine the properties discussed above from an empirical (or descriptive) perspective, using sibling and family data from UK Biobank (UKB) and from in-vitro fertilization (IVF) clinics. In the latter case routine embryo biopsies taken for genetic screening purposes have been genotyped by the company Genomic Prediction (GP), using an Affymetrix array similar to the one used by UKB. Specifically, we examine variance in and Figure 1. Trait and genetic variation among siblings is smaller than in the general population, but still significant. Population genetics theory implies that the standard deviation in sibling PGS is √ 2 smaller than for the general population. We demonstrate this empirically using UKB and IVF clinic data, and illustrate that variation in important traits and disease risks among siblings is substantial. Note that the exact √ 2 relation only holds at the phenotypic level if the trait is 100% heritable. www.nature.com/scientificreports/ reproduction. We develop a new sibling-pair method for detecting recombination events by comparing the genotypes of two siblings. Large biobanks such as UKB have tens of thousands of sibling pairs, which allow highstatistics mapping of recombination rates in sexual reproduction. The sibling-pair method makes use of the following observation (see "Detection of recombination events using sibling pairs" for more details). Two siblings, sib1 and sib2, will have identical DNA at almost every locus in certain regions (i.e., identical by descent or IBD) which, in total, amount to roughly one fourth of their entire genome. The boundaries of these regions are easy to detect: crossing such a boundary causes the average pairwise similarity to drop from nearly 100% to a much lower value. We can search for adjacent blocks of SNPs with very different pairwise (sib1 vs sib2) similarity. The boundary separating these blocks is a recombination event, and the distribution of these boundaries characterizes the recombination rate as a function of position on each chromosome.
Note that while these regions where sib1 and sib2 have exactly the same DNA sequence are particularly easy to detect and analyze (i.e., to find where they begin and end), nevertheless the recombination events themselves, which take place in the meiosis that produces a specific sperm or egg cell, are obviously independent of each other (sperm vs egg, or sib1 vs sib2). Hence the location of each boundary of these special regions is an unbiased and independent draw from the probability distribution which characterizes recombination. Cataloging the location of these boundaries therefore allows us to estimate this probability distribution-i.e., the recombination rate (in Morgans) as a function of position on each chromosome. Because this method of recombination mapping only requires the genotypes of two siblings, it can be applied to very large existing datasets.
As a check on the method, we test it using datasets in which we have access to parental as well as sibling genotypes. IVF family data is especially useful here because the number of embryos is typically larger than the number of siblings in a family, and the number of pairwise analyses one can use for testing is large. By specifically looking at SNPs where the mother and father are homozygous and heterozygous (or vice versa), we can easily track recombination events which switch the chromatid region that is passed on to the child. We have used smaller datasets with both parent and sibling data to test and calibrate the sibling-only method described above. See "Detection of recombination events" and "Detection of recombination events using sibling pairs" for more details.
We have developed the sibling-pair method for detecting recombination breaks using imputed SNP genotypes. This limits the spatial resolution of the results relative to what can be obtained using whole genomes, however it is sufficient to map out the probability distribution and to locate recombination hotspots (i.e., where the local recombination rate is much larger than the average for the chromosome). In this paper we apply the method to imputed SNP data from UKB and from an IVF family dataset.
The sibling-pair method can also be applied to whole genome data, which would result in more precise spatial resolution. We anticipate that large datasets of recombination events can be analyzed using machine learning techniques to identify the specific DNA features (e.g., motifs) which are associated with higher probability of recombination in a given region.
In this paper we use a number of different phenotypes and PGS, based on previous publications 3,27,43,44 , on siblings and unrelated populations. See the Supplement for additional information on how the phenotypes are specified and the PGS are trained. In addition to the PGS, there is a Genetic Health Index, which is a sum of 20 PGS-derived disease risks weighted by life expectancy impact of each disease. The specific index used in this work is described in 43 , including details of its construction and properties. Additional information about biobanks can be found in "UK Biobank and GP Biobank" and in the Supplement.

Results
Modern developments such as large-scale genetic biobanks and PGS allow us to explore a number of quantities that arise in classical population genetics. We present such results in this section. In addition, we correct some misconceptions found in the literature 42 concerning within family variation of PGS/phenotype compared to the general population.
Overview of methods. Some basic explanation of the notation is required before presenting the results.
The methods themselves, including details about recombination, are described in "Methods and data".
We repeatedly compare the PGS differences among unrelated individuals and siblings. U denotes the difference, or the distribution thereof, between unrelated individuals paired randomly. That is, the PGS difference between two randomly chosen individuals. S denotes the difference (or distribution of differences) between pairs of genetic siblings.
We extract the assortativity m from the covariance between the PGS (trained on phenotypes adjusted for common covariates like age and sex) of two siblings, S 1 , S 2 , and the general PGS variance V according to We compute the narrow-sense heritability h 2 as the average of two methods: (1) the ratio between PGS variance and phenotypic variance, and (2) the square of the correlation between PGS and phenotypes.
Lastly, we also present results for a Genetic Health Index, which is a linear combination of absolute risks estimated from PGS 43 . Expressed in terms of the average lifespan reduction l d , average lifetime risk ρ d and PGSestimated individual risk r d for each disease d , the Genetic Health Index is Sibling vs unrelated PGS results in UK Biobank. The results discussed here provide an empirical exploration of the theoretical predictions given in "Polygenic scores and classical results from population genetics". We identify the set of sibling pairs of self-reported European (White, in UKB terminology) ancestry within the UKB. We construct a set of randomly paired individuals: each individual is assigned to a pair with the partner chosen at random from the aggregate sibling group. We then calculate the phenotype difference between sibling pairs and random pairs with the expectation that the sibling pairs will display less variation that that of the random pairs. In Fig. 2, we show the difference in phenotype versus the difference in PGS between sibling pairs and random pairs for height and BMI. Figure 2 exhibits the similarity of sibling pairs in phenotype and PGS. Additionally several quantities can be extracted from sibling data. The narrow-sense heritability captured by a polygenic predictor can be estimated by taking the ratio of phenotype-to-PGS variation or by computing phenotype-to-PGS correlation (see Eqs. (3) and (5)). We present the average of these two methods. The correlation of trait values between sibling pairs and total additive variance can be used to estimate assortivity, see Eq. (6).
In Table 1, for quantitative phenotypes we list PGS and phenotype correlations between siblings, the narrowsense heritability, and the implied assortivity. We also compare the standard deviations found in sibling (S) and unrelated (U) distributions to test whether the former are smaller by √ 2. For case-control phenotypes, we can still observe the PGS difference and correlation between siblings in PGS and also demonstrate that the variance in scores between siblings is quite substantial. In Fig. 3, we display the difference in sibling vs random pairs for PGS in Type 2 Diabetes; a corresponding figure for genetic health index (see Eq. (12)) is shown in Fig. 4. In Table 2, we compute PGS correlations, implied assortativity m, and test the √ 2 relationship between sibling (S) and unrelated (U) standard deviation in PGS. The interpretation of m here (naively, negative assortation on certain disease risk polygenic scores) is unclear to us and deserves further investigation.
These calculations summarize how the collection of sibling pairs compares to the collection of unrelated pairings. We now explicitly show that individual families display a level of variance comparable to what is claimed. In the prior plots and tables, we focused solely on pairs of (full) siblings-however, in the UK Biobank, larger family structures are present. We identify larger families by collecting all sets such that if A-B and B-C are full sibling pairs, the superset A-B-C must be a family (we filter against parent-child relationships). The number and size of families found this way is described in the Supplement. For the largest families in the UK Biobank, we display the intra-family PGS distribution alongside the population PGS distribution in Fig. 5. Figures 3 and 5 show that some of the variation in the population PGS variance is due to inter-family variation and that the intra-family variation is reduced in comparison. For illustration we use the Type 2 Diabetes polygenic risk score. In Fig. 6, we center PGS scores by family around the average PGS of its members. For the largest families, we display the population distribution alongside the centered PGS distribution and the re-centered distribution for each family. This demonstrates that, within a particular family, difference in intrafamily polygenic score has close to 1/2 the variance of the general population which includes both an intra-and inter-family variation.

Figure 2.
Reduced, but still significant, variation among siblings relative to general population. UKB sibling pair vs unrelated pair differences in phenotype (vertical axis) and corresponding difference in PGS (horizontal axis). The left and right panels are height and BMI, respectively.  Fig. 7, we display the general UK Biobank PGS distribution, the GP parent-embryo distribution, and the scores of individuals (organized by family) which make up the parent-embryo cohort.  www.nature.com/scientificreports/ As before, the variation in the cohort has an inter-and intra-family component. However, for this cohort we have access to the PGS of the parents for each embryo. We next demonstrate that distribution of embryo PGS is distributed around the midparent value which leads to the inter-family variation. In Fig. 8, we display the general parent-embryo distribution alongside the midparent-centered parent-embryo distribution, along with . Health Index Polygenic Score: reduced, but still significant, variation among siblings relative to general population. For UKB sibling pair vs unrelated pair differences in Health Index we also find √ 2 reduction in standard deviations of PGS differences for siblings, as indicated by the dashed lines. www.nature.com/scientificreports/ the centered family in the parent-embryo cohort. Note that even when parental PGS differences are small, the variation within a set of embryos remains substantial. Figure 9 shows the Genetic Health Index score (as described in 43 ) distributions in the UK Biobank compared to GP parental (saliva) and embryo samples. Figures 10 and 11 use the Health Index in place of the Type 2 Diabetes polygenic score, displaying distributions within the UK Biobank and the parent-embryo cohort, as well as individual scores organized by family. Health Index is a nonlinear function of genotype (as we discuss further below) so the √ 2 relationship between general population and within-family standard deviation does not necessarily hold.
Among the families displayed in these figures, at position number 15 from the left, we encounter an interesting case of sibling polygenic distribution relative to the parents. In the family all siblings have significantly higher Health Index score than the parents. This arises in an interesting manner: the mother is a high-risk outlier for condition X and the father is a high-risk outlier for condition Y. (We do not specify X and Y, out of an abundance of caution for privacy, although the patients have consented that such information could be shared.) Their lower overall Health Index scores result from high risk of conditions X (mother) and Y (father). However, the embryos, each resulting from unique recombination of parental genotypes, are normal risk for both X and Y and each  www.nature.com/scientificreports/ embryo has much higher Health Index score than the parents. This can be discerned from detailed examination of individual genomes and associated polygenic scores. It is worth noting that the Health Index sums absolute risk (probability of getting the condition) per condition times lifespan impact. Absolute risk is a nonlinear function of polygenic score, whereas all polygenic scores used in this paper (and almost all in the published literature) are linear in the genotype. In this case the absolute risks of the parents for X (mother) and Y (father) are very high. Embryo genomes, due to the diploid nature of human sexual reproduction, are a blend of maternal and paternal genotypes. In this case all of the polygenic scores of the embryos cluster around the parental midpoint values for both X and Y. Due to nonlinearity, the absolute risks for X and Y of the embryos are all much lower than for the parents, and the embryos have higher Health Index scores than the parents.
Detection of recombination events. Results are obtained using the method described in "Detection of recombination events using sibling pairs", which identifies recombination events using statistical properties of paired sibling genotypes. We apply the algorithm of "Detection of recombination events using sibling pairs" to imputed UKB data. In Fig. 12, we display the cumulative number of breaks from the sibling algorithm compared to the results from deCODE 47 for chromosomes 1 and 22. Figure 13 are histograms of breaks by location on       www.nature.com/scientificreports/ It is important to note that even slightly inaccurate maps of recombination breaks are still useful for machine learning methods which attempt to predict local recombination rate based on DNA motifs in the region. From the machine learning perspective, the impact of modest inaccuracies is simply a small decrease in statistical signal. The over-representation of certain motifs or specific mutations in hotspot regions should still be detectable by automated learning methods. The primary benefit from the sibling-pair method is a very large expansion in the size (and ancestry group coverage) of available maps (catalogs) of recombination events across the genome.
In Table 3, we summarize cumulative results for each chromosome generated from the sibling method and from deCODE. Note that deCODE provides the cM length as the average over parents-which results in 1/2 the number of expected breaks-while the sibling-pair method is designed to detect 1/2 of all breaks (i.e., the boundaries between type 1 and type 2 regions; see "Methods and data").
Although the UKB cohort is predominantly self-reported European, there are a non-trivial number of sibling pairs with other self-reported ancestry. The question of variation in recombination rate by self-reported ancestry group can be investigated, at least at a preliminary level, using the non-European siblings found in UKB. Application of the new method to UKB sibling pairs yields the following whole genome recombination lengths for the three largest (self-reported) ancestry groups: Europeans (21, 514 pairs) 3338 cM, South Asian (292 pairs) 3073 cM, African (174 pairs) 3453 cM. If we adopt the European distribution as the null hypothesis-mean 33.38 breaks per individual in a pair, SD = 4.8-we obtain the following results: (1) the South Asian deviation in average is 2.65 fewer breaks (with N = 292). In null model SD units this is 2.65/(4.8/ √ 292) = 9.43 , so statistically significant. (2) The African deviation in average is 1.15 more breaks (N = 174). In null model SD units this is 1.15/(4.8/ √ 174) = 3.16 , so statistically significant. We plan to investigate this topic in the future with more non-European sibling data as it becomes available. It is possible that the differences in total recombination rates found above are the result of some kind of systematic error. For example, the sibling-pair method might exhibit higher or lower recombination detection rates due to differences in how genome imputation works across different ancestry groups.

Discussion
Roughly half of the genetic variation in a population is within-family, and the other half between families. This implies that the standard deviation of sibling polygenic score distributions is roughly √ 2 times smaller than for the general population. Siblings are more similar to each other than randomly selected individuals in the population, but there is still significant variation within each family. We have explored these, and related, consequences of simple population genetics using the power of modern genotyping, large biobanks, and a unique dataset of IVF families.
In earlier work it was shown that PGS perform almost as well in differentiating siblings by disease status or quantitative trait value as in pairs of unrelated individuals 27,44 . It was also shown that expected gains from selection on PGS or Health Index score are substantial 43 . For the latter result to hold there must be significant variance among siblings on the relevant PGS. In this paper we exhibited this variation using many traits.
We have also developed a new technique that directly probes DNA recombination-the molecular mechanism responsible for sibling genetic differences. This new sibling-pair method can be applied to large datasets with many thousands of sibling pairs. In this project we created a map of roughly 1.44 million recombination events using UKB genomes. Similar maps can now be created using other biobank data, including in non-European ancestry groups that have not yet received sufficient attention. The landmark deCODE results were obtained under special circumstances: the researchers had access to data resulting from a nationwide project utilizing genealogical records (unusually prevalent in Iceland) and widespread sequencing. Using the sibling-pair method results of comparable accuracy can be obtained from existing datasets around the world-e.g., national biobanks in countries such as the USA, Estonia, China, Taiwan, Japan, etc. www.nature.com/scientificreports/ It seems possible that deep learning approaches using this training data could identify new DNA motifs that are associated with increased or decreased local recombination probability. This may advance our molecular understanding of meiosis, a fundamental biological process. Regarding applications, we note that many chromosomal abnormalities are the result of errors introduced through recombination. Improvements in computational analysis of embryo genotypes may result in better accuracy in embryo screening against aneuploidy.

Methods and data
UK Biobank and GP Biobank. The two datasets used in this work are the 2018 release of the UKB and a cohort of parent and embryo genotypes from families undergoing preimplatation genetic testing for polygenic (PGT-P) screening at GP.
The GP samples are processed on the same (Affymetrix) SNP array platform which is used by the UKB. The details of the genotyping procedure and quality control are described at length in 48 . We restrict all samples to those with European ancestry with a total of 32 individual parent-embryo groups. Pre-implantation genetic testing for aneuploidy was carried out on all embryos and only euploid embryos were considered for analysis in this work. GP data was collected along internal GP guidelines, conforming with local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study. Written informed consent was obtained from the individual(s) for the use of any potentially identifiable data included in this article, although no identifiable information was used, nor made available to researchers.
UKB data was collected under policies conforming with national and requirements and subject to privacy rights (https:// www. ukbio bank. ac. uk/ priva cy-policy). All data handled by researchers in this work was de-identified.
MSU researchers working on this project do not have access to any identifiable data and are covered under MSU IRB [STUDY00006493].
For the UK Biobank, all of our samples are restricted to self-reported European (in UKB terminology, white) ancestry unless otherwise stated. The details of the genotyping and quality control steps are found in the supplements.
Polygenic predictors are obtained through training on either the UK Biobank or a wholly separate GWAS. The testing set we use is the set of 21 k sibling pairs in the UK Biobank ( ∼ 40 k individuals)-all of whom have been withheld from predictor training and tuning. The majority of predictors are trained using the LASSO algorithm in scikit-learn. The quantitative trait predictors are described in detail in 3,27 , while case-control predictors are described in 44 . Several predictors are trained using PRS-CS and publicly available GWAS which have a much higher case-count than available in the UK Biobank. Details of the training are described in the Supplement of 43 and we provide a brief summary in the Supplement.
Polygenic scores and classical results from population genetics. In this section, we review some classical results of population genetics with an emphasis on how they relate to polygenic scores. A phenotype is typically modeled (assuming no G × E interactions) as a linear combination of genetic and environmental effects as where G, E are the genetic and environmental components. The genetic contribution to a particular phenotype is known as the heritability and is defined as where V denotes the additive variance. The genomic variance is typically separated into additive, dominance and epistasis components. We are from now on focusing on the additive component only and denote it as V, without any subscripts. The linear component of heritability is referred to as the narrow-sense heritability and is given by i.e., it is just the additive component of variance. In this work, we assume additive models such that the genetic contribution to the phenotype of an individual i is where β j is the effect size for a given SNP j. With this definition we can estimate the narrow-sense heritability by a second method In Table 1, the average of both methods is recorded. The covariance between sibling polygenic scores S 1 and S 2 can be calculated, assuming additivity, as Here m is the assortivity, or the degree to which mates resemble each other on this specific polygenic score ( m = 0 for random mating) 49 .
For a given pair of siblings the sum over genetic loci can be separated into the subset which are IBD and another subset which vary randomly as in the general population (neglecting assortation): so that the difference in sibling values is Since on average half the SNPs are IBD and that variance is additive, the variance in the difference between siblings S , which arises from the loci that are not IBD will be half that found for the difference between unrelated individuals U: Again, we have neglected assortation. Slight deviations from this result are observed for some traits for which assortation is significant. Consequently the standard deviation in the distribution of sibling differences will be √ 2 times smaller than for unrelated pairs of individuals. Similarly, one can show that the distribution of sibling PGS values, which is centered at the parental midpoint, has standard deviation which is √ 2 times smaller than the distribution of PGS values in the general population 50 . Similar results can be found in Eq. (2) of a more comprehensive review in 51 and empirically for Crohn's Disease, Schizophrenia and Autism in references 52,53 .
Note in Figs. 5 and 6 (based on UKB data) we do not have access to parental midpoint values and hence plot the sibling distribution relative to the average among the available sibs. This is subtly different from recentering using parental midpoint (the average among sibs only approaches parental midpoint when number of sibs is large, and in the UK data this is usually not the case), and further reduces the variance of the resulting distribution. The √ 2 effect (reduction in variance among sibs) from population genetics is easier to evaluate using differences in sibling scores, because parental midpoint is not required.
Binary traits and the Genetic Health Index. The discussion until this point has assumed quantitative phenotypes. For case-control phenotypes, several approaches exist in the literature. For the context of this work, the mixed-distribution model of polygenic scores is adopted where both cases/controls follow normal distributions but are shifted from one-another. This is written as where x is the PGS, µ 0/1 is the mean PGS of controls/cases, σ is the standard deviation of PGS among the controls (The standard deviations for cases and controls do not have to be equal in principle but are usually very close in practice. The model assumes them equal and we use σ 0 due to better statistics.) and π is the lifetime risk. The mixed-distribution model can be translated into risk as follows With this model of risk, we can construct a quantity which describes expected lifespan gain for an individual i given by where ρ d is the average lifetime risk of a particular disease d and l d is the average lifespan reduction of that disease in units measured in years (life-years, LY). This quantity is referred to as a Genetic Health Index. Clearly this is only a proxy for lifespan gain as it does not account for disease covariances and cannot possibly include lifespan impacts from all phenotypes. For further discussion about properties and behavior of this Genetic Health Index, see 43 . Detection of recombination events using sibling pairs. Here we describe a novel algorithm for detecting recombination events using sibling pairs. This sibling-pair method does not require parental genotypes. Let s 1 and s 2 be the two sibling genotypes. Consider a specific SNP location and define two DNA windows of size N flanking this locus (Fig. 14). (We may refer to one window w − on the "left" of the reference SNP and the other w + to the "right".) The algorithm computes the similarity of genotypes s 1 and s 2 in each of the flanking windows. For example, the similarity could be defined as simply the number N ± of loci in each window which (10) φ(x) = (1 − π)N (x, µ 0 , σ ) + πN (x, µ 1 , σ ) , www.nature.com/scientificreports/ are identical in state in both siblings ( N ± are each numbers between 0 and N). We can then define an indicator function Z(N + , N − ) (see below) which is used to detect discontinuities in similarity, comparing w + with w − . For a given window, there are 3 distinct types of region for a sibling pair. Let the father's chromosome strands (chromatids) be Y a , Y b and the mother's chromosome strands be X a , X b , and adopt the notation s 1 : s 2 to describe the genotype of each of the siblings at the same position. Then, the state of a specific region in the sibling pair must realize one of the following possibilities: From these observations, roughly one quarter of a sibling's genome will match identically with the other. A recombination event can then be detected by scanning along the chromosome for regions in which both siblings agree completely (modulo rare genotyping errors) in w − but differ from identical in w + . The location at which the identical (type 1 region) to non-identical (type 2 region) change happens is a recombination event.
Almost all breaks will be transitions from region type 1 to type 2 or region type 2 to type 3 because a jump from type 1 to type 3 would require a coincidence in meiosis between both the sperm and egg, which is highly unlikely. The sibling-pair method is designed to detect breaks from type 1 to 2 (or, in principle, 1 to 3) regions, but will not record changes from type 2 to 3. Therefore the sibling-pair method will detect 1/2 the total number of breaks.
We define an indicator function Z which signals discontinuities in statistical properties of flanking windows: where a is a positive number much less than N, which keeps the denominator from ever becoming zero. This function has useful properties: 1. The numerator is large in magnitude when the s 1 : s 2 similarities in the two flanking windows are very different (e.g., if one window is in a type 1 region and the other is not). 2. The denominator is small when at least one of the flanking windows has close to maximal similarity N, which tends to occur in type 1 regions.
Z reaches a local maximum near a recombination break point, with Z > 0 indicating a transition from type 1 to 2 region (from "left" to "right"), while Z < 0 indicates a transition from type 2 to type 1 region. The parameters N, a are tuned, and performance evaluated, using family data (both IVF and from UKB) in which parental genotypes are available. The parameters are chosen so that Z is robust against genotyping errors or mutations (which reduce similarity of IBD regions from 1), and clusters of low MAF SNPs (which can cause non-IBD regions to appear IBD). With parental genotypes we can make use of hetero-and homozygous loci to directly track which segments of chromatid have been passed to each sibling.
What is novel here is that recombination breaks can be detected using sibling pair data alone, once the indicator function Z has been properly adjusted. Many biobanks, including UKB, have large numbers of siblings in their cohort, whereas trios containing mother, father, and child are comparatively rare. Thus, this new technique will add significantly to the amount of data available for study of recombination. In this work, we create a dataset of 1.44 million recombination events found in 21 k UKB sibling pairs. We anticipate that datasets like ours will be useful in machine learning to discover the molecular mechanisms which control the local rate of recombination in the genome. For example, deep learning could be used to detect specific DNA motifs which are associated with recombination hotspots or regions where recombination is rare.
Note we assume that each recombination event can be considered as a draw from a probability distribution. Effects such as crossover interference (which are consequences of the underlying molecular mechanisms) cause (13) Z = (N + − N − ) (a + (N − max(N + , N − )) , Figure 14. The recombination algorithm detects breaks between regions in which both siblings have identical SNPs and regions where they differ. SNP genotypes for sibling 1 and 2 are shown as counts of the major alleles. The recombination break occurs at the boundary between regions. On the left both siblings have inherited the same chromatid strands from the parents. A recombination break, which could occur in either the egg or the sperm progenitor of either sibling, causes mismatches in the SNP genotypes in the region on the right-i.e., on the right the siblings no longer have the same chromatid strands.
Scientific Reports | (2023) 13:376 | https://doi.org/10.1038/s41598-023-27561-z www.nature.com/scientificreports/ breaks to be less likely to occur in close proximity to each other, violating the assumption that each individual break is an independent event. However, this is a small correction as with only 70 breaks across 22 chromosomes it is unlikely that two breaks occur close together. Since crossover interference does not inhibit a sib1 break conditional on a nearby sib2 break, the results from the sib-pair method detects very slightly more breaks in close proximity than occur in a single meiosis. However, as mentioned this is a very small effect as nearby breaks are rare.

Data availability
Access to the UK Biobank resource is available via application (http:// www. ukbio bank. ac. uk). The authors acknowledge acquisition of data sets via UK Biobank Main Application 15326. Genomic Prediction (GP) datasets generated for this study will not be made publicly available, in accordance with UKB and GP policy. UKB data was collected under policies conforming with national and local requirements and subject to privacy rights (https:// www. ukbio bank. ac. uk/ priva cy-policy). All data handled by researchers in this work was de-identified. GP data was collected along internal GP guidelines, conforming with local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study. Written informed consent was obtained from the individual(s) for the use of any potentially identifiable data included in this article, although no identifiable information was used, nor made available to researchers. GP data is not available to researchers outside the company. MSU researchers working on this project do not have access to any identifiable data and are covered under MSU IRB [STUDY00006493]. The 1.44 million recombination events found using the UKB siblings are available for use by other researchers www. github. com/ MSU-Hsu-Lab/ sibli ng-varia tion-paper-2022.